Preprocess

A. Observe & Clean Datasets

  1. Observe data types and overall summary of the datasets

  2. Check duplicate values and remove them if there are any

  3. Eliminate matches which are not started yet, since we won’t be able to include them in our analysis

  4. Eliminate the matches from Odds table which are not contained in the remaining Matches dataset

  5. Convert Unix Epoct Time to Date

  6. Acquire the final odds for each bookmaker for each betType - totalhandicap pair

B. Bookmaker Selection

In this part I aim to detect the bookmakers that will be inclueded in my analysis. To do so, I follow the following steps:

  1. Calculate the number of betType - totalhandicap pair provided by each bookmaker

  2. Then check the common match counts for different set of bookmakers

## [1] "851 Common Matches for 5 number of selected bookmakers, which are:"
## [1] "1xBet, Unibet, Pinnacle, Betfair Exchange, 10Bet"
## [1] "-----------------------------------------------------------------"
## [1] "851 Common Matches for 6 number of selected bookmakers, which are:"
## [1] "1xBet, Unibet, Pinnacle, Betfair Exchange, 10Bet, BetVictor"
## [1] "-----------------------------------------------------------------"
## [1] "851 Common Matches for 7 number of selected bookmakers, which are:"
## [1] "1xBet, Unibet, Pinnacle, Betfair Exchange, 10Bet, BetVictor, bet365"
## [1] "-----------------------------------------------------------------"
## [1] "850 Common Matches for 8 number of selected bookmakers, which are:"
## [1] "1xBet, Unibet, Pinnacle, Betfair Exchange, 10Bet, BetVictor, bet365, ComeOn"
## [1] "-----------------------------------------------------------------"
## [1] "821 Common Matches for 9 number of selected bookmakers, which are:"
## [1] "1xBet, Unibet, Pinnacle, Betfair Exchange, 10Bet, BetVictor, bet365, ComeOn, 12BET"
## [1] "-----------------------------------------------------------------"
## [1] "821 Common Matches for 10 number of selected bookmakers, which are:"
## [1] "1xBet, Unibet, Pinnacle, Betfair Exchange, 10Bet, BetVictor, bet365, ComeOn, 12BET, 188BET"
## [1] "-----------------------------------------------------------------"

As a result I decide to conduct my analysis with 8 bookmakers since there is no big difference with 7 in terms of common matches.

C. betType - totalhandicap pair Selection

The objective of this part has the same notion with the previous part.

  1. Calculate common match count for of betType - totalhandicap pairs for the selected 8 bookmakers

  2. Then check the common complete cases for different set of betType - totalhandicap pairs

## [1] "781 Comple Cases for total number of 1betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"
## [1] "392 Comple Cases for total number of 2betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"
## [1] "74 Comple Cases for total number of 3betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"
## [1] "9 Comple Cases for total number of 4betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"
## [1] "1 Comple Cases for total number of 5betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"
## [1] "0 Comple Cases for total number of 6betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"
## [1] "0 Comple Cases for total number of 7betType-totalhandicap pair"
## [1] "-----------------------------------------------------------------"

AS a result I observe huge decrease when the selected number of pairs increased from 2 to 3. However, selecting 3 pairs will create a higher feature dimension which will be beneficial to observe the PCA for this assignment. (or the additional betType may be better in terms of classifying the game results)

D. Finalize and spread the Odd data set

According to the bookmaker & betType- totalhandicap pair Selection I’ve decided to conduct my analysis with two alternatives:

  1. 8 Bookmakers and 2 betType- totalhandicap pair
  2. 8 Bookmakers and 3 betType- totalhandicap pair

to observe different situations.

E. Calculate Match Results

  1. Determine total score of the matches by subtracting individual scores from the score column

  2. Flag matches if its ended “over” according to handicap = 2.5

  3. Flag match results according to home & away scores

  4. Merge neccesary info from the Matches dataset with the Odds dataset

Task 1-2 Results

PCA Analysis for 2 pairs

It seems the first 3 componen explains almost 90 percent of the variance. To better look we can check the summary of the PCA.

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     4.6061660 3.3554988 1.90415854 1.27331560
## Proportion of Variance 0.5304191 0.2814843 0.09064549 0.04053332
## Cumulative Proportion  0.5304191 0.8119034 0.90254892 0.94308224
##                            Comp.5      Comp.6      Comp.7      Comp.8
## Standard deviation     0.70124015 0.539635163 0.500107037 0.465917413
## Proportion of Variance 0.01229344 0.007280153 0.006252676 0.005426976
## Cumulative Proportion  0.95537568 0.962655836 0.968908513 0.974335488
##                             Comp.9    Comp.10     Comp.11     Comp.12
## Standard deviation     0.434028412 0.39055037 0.332712254 0.308941728
## Proportion of Variance 0.004709517 0.00381324 0.002767436 0.002386125
## Cumulative Proportion  0.979045005 0.98285824 0.985625681 0.988011806
##                           Comp.13     Comp.14     Comp.15     Comp.16
## Standard deviation     0.29684274 0.264145474 0.230399814 0.221780304
## Proportion of Variance 0.00220289 0.001744321 0.001327102 0.001229663
## Cumulative Proportion  0.99021470 0.991959017 0.993286119 0.994515781
##                             Comp.17      Comp.18      Comp.19      Comp.20
## Standard deviation     0.1993536770 0.1574243161 0.1351686255 0.1327177731
## Proportion of Variance 0.0009935472 0.0006195604 0.0004567639 0.0004403502
## Cumulative Proportion  0.9955093284 0.9961288887 0.9965856527 0.9970260029
##                             Comp.21      Comp.22      Comp.23      Comp.24
## Standard deviation     0.1232792417 0.1165778140 0.1037946986 0.0920109137
## Proportion of Variance 0.0003799443 0.0003397597 0.0002693335 0.0002116502
## Cumulative Proportion  0.9974059471 0.9977457068 0.9980150403 0.9982266905
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     0.0891023494 0.0848414018 0.0823485784 0.0799503217
## Proportion of Variance 0.0001984807 0.0001799516 0.0001695322 0.0001598013
## Cumulative Proportion  0.9984251712 0.9986051228 0.9987746550 0.9989344564
##                             Comp.29      Comp.30      Comp.31      Comp.32
## Standard deviation     0.0774394278 0.0734400156 0.0713270053 0.0664915296
## Proportion of Variance 0.0001499216 0.0001348359 0.0001271885 0.0001105281
## Cumulative Proportion  0.9990843780 0.9992192139 0.9993464024 0.9994569305
##                             Comp.33      Comp.34      Comp.35      Comp.36
## Standard deviation     0.0641875698 6.188490e-02 5.749048e-02 5.626759e-02
## Proportion of Variance 0.0001030011 9.574353e-05 8.262888e-05 7.915106e-05
## Cumulative Proportion  0.9995599316 9.996557e-01 9.997383e-01 9.998175e-01
##                             Comp.37      Comp.38      Comp.39      Comp.40
## Standard deviation     5.067750e-02 4.684621e-02 0.0413909649 2.873687e-02
## Proportion of Variance 6.420524e-05 5.486418e-05 0.0000428303 2.064519e-05
## Cumulative Proportion  9.998817e-01 9.999365e-01 0.9999793548 1.000000e+00

Once we observe the seperation between the classes, selected components(existing features) are not capable for a good classification of over and under results. However, we can observe slightly better seperation for game results, especially for away result.

MDS Analysis for 2 pairs

MDS analtyis is conducted with the following steps:

  1. Calculate Manhattan Distance

  2. Calculate Euclidean Distance

  3. Create MDS for each distance

We observe very smilar result with PCA and MDS (Manhattan). The result of the MDS(Euclidean) may seam different but I would say there is no difference in seperatng classes both over/under and game result

PCA Analysis for 3 pairs

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4    Comp.5
## Standard deviation     4.2453141 3.0744946 2.3167373 1.64341761 0.9481625
## Proportion of Variance 0.4505673 0.2363129 0.1341818 0.06752054 0.0224753
## Cumulative Proportion  0.4505673 0.6868802 0.8210620 0.88858254 0.9110578
##                            Comp.6    Comp.7     Comp.8      Comp.9
## Standard deviation     0.81582174 0.8024985 0.67551035 0.589162653
## Proportion of Variance 0.01663913 0.0161001 0.01140786 0.008677816
## Cumulative Proportion  0.92769697 0.9437971 0.95520492 0.963882740
##                            Comp.10     Comp.11    Comp.12     Comp.13
## Standard deviation     0.508878494 0.486992769 0.45528365 0.381929032
## Proportion of Variance 0.006473933 0.005929049 0.00518208 0.003646745
## Cumulative Proportion  0.970356673 0.976285722 0.98146780 0.985114546
##                            Comp.14    Comp.15     Comp.16     Comp.17
## Standard deviation     0.344001592 0.32308390 0.277090833 0.230970732
## Proportion of Variance 0.002958427 0.00260958 0.001919483 0.001333687
## Cumulative Proportion  0.988072974 0.99068255 0.992602037 0.993935724
##                            Comp.18      Comp.19      Comp.20      Comp.21
## Standard deviation     0.202268799 0.1860154757 0.1691343487 0.1555592152
## Proportion of Variance 0.001022817 0.0008650439 0.0007151607 0.0006049667
## Cumulative Proportion  0.994958541 0.9958235845 0.9965387452 0.9971437120
##                             Comp.22      Comp.23      Comp.24      Comp.25
## Standard deviation     0.1393144398 0.1221515259 0.1093819489 0.1062729872
## Proportion of Variance 0.0004852128 0.0003730249 0.0002991103 0.0002823487
## Cumulative Proportion  0.9976289248 0.9980019497 0.9983010600 0.9985834087
##                             Comp.26      Comp.27      Comp.28      Comp.29
## Standard deviation     0.0970954774 0.0953548351 0.0785863748 0.0727628323
## Proportion of Variance 0.0002356883 0.0002273136 0.0001543955 0.0001323607
## Cumulative Proportion  0.9988190969 0.9990464106 0.9992008060 0.9993331668
##                             Comp.30      Comp.31      Comp.32      Comp.33
## Standard deviation     0.0709612563 0.0632535037 6.042410e-02 5.632042e-02
## Proportion of Variance 0.0001258875 0.0001000251 9.127681e-05 7.929973e-05
## Cumulative Proportion  0.9994590543 0.9995590794 9.996504e-01 9.997297e-01
##                             Comp.34      Comp.35      Comp.36      Comp.37
## Standard deviation     5.137710e-02 0.0491123220 4.432768e-02 3.785539e-02
## Proportion of Variance 6.599017e-05 0.0000603005 4.912358e-05 3.582577e-05
## Cumulative Proportion  9.997956e-01 0.9998559466 9.999051e-01 9.999409e-01
##                             Comp.38      Comp.39      Comp.40
## Standard deviation     3.422399e-02 3.234913e-02 1.210015e-02
## Proportion of Variance 2.928203e-05 2.616166e-05 3.660342e-06
## Cumulative Proportion  9.999702e-01 9.999963e-01 1.000000e+00

MDS Analysis for 3 pairs

For this part, my comments are the same in general. However, I can add the 3rd pair, which is an assian handicap might be better in terms of seperating the end classes, because the graphs seems more seperated than the previous ones. Yet, I have to say that might be an ilussion due to less number of observations

Task 3

1 - Read Image

2- Structure of the image is numeric

##  num [1:512, 1:512, 1:3] 0.11 0.114 0.133 0.188 0.239 ...

And the size of the image is:

## [1] 512 512   3

Since we reshaped for 512-512 pixels and it is rgb colored image.

And the image is the following:

And the rgb channels are the followings:

3 - Noisy Image:

And the rgb channels of the noisy the followings: